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Abstract 

Graphics processing units (GPUs) are recently being used to an increasing degree for general 
computational purposes. This development is motivated by their theoretical peak performance, 
which significantly exceeds that of broadly available CPUs. For practical purposes, however, it 
is far from clear how much of this theoretical performance can be realized in actual scientific 
applications. As is discussed here for the case of studying classical spin models of statistical 
mechanics by Monte Carlo simulations, only an explicit tailoring of the involved algorithms to 
the specific architecture under consideration allows to harvest the computational power of GPU 
systems. A number of examples, ranging from Metropolis simulations of ferromagnetic Ising 
models, over continuous Heisenberg and disordered spin-glass systems to parallel-tempering 
simulations are discussed. Significant speed-ups by factors of up to 1000 compared to serial 
CPU code as well as previous GPU implementations are observed. 

Keywords: Monte Carlo simulations, Graphics processing units, Ising model, Heisenberg 
model. Spin glasses. Parallel tempering 



1. Introduction 

Numerical techniques and, in particular, computer simulations are by now firmly established 
as a third pillar of the scientific method, complementing the pillars of experiment (or observation) 
and mathematical theory, both of which were erected already at the birth of modern science in 
the Scientific Revolution. While in the early times [1] simulation studies were not quite compet- 
itive with analytical calculations in condensed matter physics and quantum field theory (usually 
based on perturbative and variational approaches) nor were their outcomes adequate for com- 
parison with experimental results (usually due to limited length or time scales), this situation 
has dramatically changed over the past decades. This very successful race to catch up was fu- 
eled by a combination of two factors: a continuous, sometimes revolutionary refinement of the 
numerical toolbox, for instance through the invention of cluster algorithms [2], reweighting [3] 
or generalized-ensemble techniques [4, 5] in the field of Monte Carlo simulations, and the im- 
pressive increase in generally available computational power, which has followed an exponential 
form known as Moore's law for the past forty years. At any time, however, there has been no 
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shortage of fascinating scientific problems whose computational study required resources at or 
beyond the cutting edge of the available technology. This has led scientists to regularly use the 
latest commodity hardware as soon as it became available, but has also motivated the design and 
construction of a number of special purpose machines such as, e.g., the cluster processor [6] and 
Janus [7] for the simulation of spin systems or a series of initiatives for QCD calculations with 
its recent addition of the QPACE architecture [8]. 

It has been true for the last couple of generations of graphics processing units (GPUs) that 
their theoretical peak performance, in particular for single-precision floating point operations 
(FLOP/s), significantly exceeds that of the corresponding x86 based CPUs available at the same 
time (as of this writing up to around 100 GFLOP/s in CPUs vs. up to 5 TFLOP/s in GPUs). It is 
therefore natural that scientists and, increasingly, also programmers of commercial applications 
other than computer games, have recently started to investigate the possible value of GPU based 
computing for their purposes; for scientific applications see, e.g., Refs. [9-12]. Apart from their 
mere peak performance, systems based on GPUs or other highly parallel architectures such as 
the Cell processor [8] might outperform current CPU based systems also in terms of their effi- 
ciency, i.e., in terms of FLOP/s per Watt or FLOP/s per Euro and thus might also contribute to 
the advancement of "green computing" ' . The low prices and convenient over-the-counter avail- 
ability of GPU devices clearly make for advantages as compared to custom-built special-purpose 
machines, for which many man-months or years need to be invested for the design, realization 
and programming of devices. The relative increase in peak performance of GPUs versus CPUs 
is achieved at the expense of flexibility, however: while today's CPUs will perform pretty well 
on most of a large variety of computer codes and, in particular, in a multi-tasking environment 
where on-the-fly optimizations such as branch prediction are essential, GPUs are optimized for 
the highly vectorized and parallelized floating-point computations typical in computer graphics 
appUcations. A one-to-one translation of a typical code written for CPUs to GPUs will therefore, 
most often, not run faster and, instead, algorithms and paralleUzation and vectorization schemes 
must be chosen very carefully to tailor for the GPU architecture in order to harvest any perfor- 
mance increases. The efficiency of such calculations in terms of FLOP/s per human time cru- 
cially depends on the availability of easily accessible programming environments for the devices 
at hand. While in view of a lack of such supporting schemes early attempts of general purpose 
GPU (GPGPU) calculations still needed to encapsulate the computational entities of interest in 
OpenGL primitives [9], the situation changed dramatically with the advent of device- specific 
intermediate GPGPU language extensions such as ATI Stream and NVIDIA CUDA [14]. In the 
future, one hopes to become independent of specific devices with general parallel programming 
environments such as OpenCL [15]. 

Classical spin systems have turned out to be extremely versatile models for a host of phenom- 
ena in statistical, condensed matter and high-energy physics, with appUcations ranging from crit- 
ical phenomena [16] over surface physics [17] to QCD [18]. A rather significant effort, therefore, 
has been invested over the years into optimized implementations of Monte Carlo simulations of 
spin models. They hence appear to be particularly suited for an attempt to fathom the potential 
gain from switching to the GPU architecture. Additionally, there are a plethora of questions 
relating to classical spin models which, decades of research notwithstanding, are still awaiting 
a combination of an increase in available computational resources, better algorithms and clever 
techniques of data analysis to find generally satisfactory answers. This applies, in particular. 
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to disordered systems such as random-field models and spin glasses [191. As will be discussed 
below, due to their short-ranged interactions and the typically simple lattice geometries such 
models appear to be near ideal problems for GPU calculations. This applies, in particular, to the 
inherently local single spin-flip algorithms discussed here. For studying the critical behavior of 
models without disorder, cluster algorithms will outperform any optimized implementation of a 
local spin-flip dynamics already for intermediate lattice sizes; the possibilities for porting such 
approaches to GPU will be discussed elsewhere [20]. For disordered systems, on the other hand, 
eflficient cluster algorithms are (with few exceptions [21]) not known. For them, instead, local 
spin-flip simulations combined with parallel tempering [5] moves are the state of the art. 

When comparing GPU and CPU performance for the case of general purpose computational 
applications, it has become customary to benchmark different implementations in terms of the 
relative speed-up (or slow-down) of the GPU code versus the CPU implementation [11, 14, 22]. 
While such numbers make for good selling points for the company producing the "better" type 
of computational device, it should be clear that speed-ups, being a relative measure, will vary 
to a large degree depending on how much effort is invested in the optimization of the codes 
for the different architectures. To avoid this problem, the main focus is put here on the abso- 
lute performance of different implementations, measured for the present application mostly in 
the wall-clock time for performing a single spin flip, citing speed-up factors only as an addi- 
tional illustration of the relative performance. If relative measures of performance are given, the 
question arises of what type of CPU code to compare to, in particular, since with the advent of 
multi-core processors and vector extensions such as SSE, CPUs also offer a significant poten- 
tial for optimizations. I decided here to use serial CPU code, reasonably optimized on the level 
of a high-level programming language and the use of good compilers, as I feel that this comes 
closest to what is most often being used in actual simulation studies. As regards the possible 
effects of further CPU optimizations, see also Ref. [22] which, however, unfortunately does not 
cite any measures of absolute performance. Simulations of the ferromagnetic Ising model us- 
ing implementations on GPU have been discussed before [9, 11, 23-25]. Compared to these 
implementations, the current approach with the double checkerboard decomposition and multi- 
hit updates to be discussed below offers significant advantages. Other applications, such as the 
simulation of ferromagnetic, short-range Heisenberg models, the simulation of Ising spin glasses 
with asynchronous multi-spin coding or parallel tempering for lattice spin systems are consid- 
ered here for the first time. For some very recent discussions of further spin models see also 
Refs. [26, 27]. 

The rest of this article is organized as foUows. In Sec. 2, 1 give some necessary background 
on the architecture of the NVIDIA GPUs used in this study and its consequences for algorithm 
design. Section 3 discusses, in some detail, the chosen implementation of a single-spin flip 
MetropoUs simulation of the two-dimensional (2D) Ising model and its performance as weU as 
a generaUzation to the three-dimensional (3D) model. Section 4 is devoted to generalizations of 
these considerations to continuous-spin systems, exemplified in the 2D Heisenberg model. In 
Sees. 5 and 6, appUcations to simulations of spin-glass models and the use of parallel-tempering 
updates are discussed. Finally, Sec. 7 contains my conclusions. 

2. The NVIDIA architecture 

As indicated above, there are significant differences in the general design of CPU and GPU 
units [28]. CPUs have been optimized over the past decades to be "smart" devices for the 
rather unpredictable computational tasks encountered in general-purpose computing. Current 
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Figure 1: Schematic representation of the architecture of current GPUs. The terminology is borrowed from that used for 
NVIDIA boards. 



Intel CPUs feature about 800 million transistors on one die. A rather small fraction of them 
is used for the ALUs (arithmetic logic units) doing actual computations, whereas most of the 
available transistors are devoted to flow control (such as out-of-order execution and branch pre- 
diction) and cache memory. This structure is very successful in increasing the performance of 
single-threaded applications, which stifl make up the vast majority of general computer codes. 
In contrast, about 80 % of the 1.4 billion transistors on a GPU die of the NVIDIA GT200 series 
(now superseded by the Fermi architecture) are ALUs. GPUs do not try to be "clever", but they 
are extremely efficient at doing the same computational steps on different bits of a large data-set 
in parallel. This is what makes them interesting for applications in scientific computing. 

Figure 1 shows a schematic representation of the general architecture of a current GPU. The 
chip contains a number of multiprocessors each composed of a number of parallel processing 
units. The systems based on the GT200 architecture used in this study feature 30 multiproces- 
sors of 8 processors each (versus 15 multiprocessors a 32 cores for the GTX 480 Fermi card). The 
systems come with a hierarchy of memory layers with different characteristics. Each processor is 
equipped with a number of registers which can be accessed with essentially no latency. The pro- 
cessors in a multiprocessor unit have read/write access to a small amount of shared memory (16 
KB in the GT200 architecture and 48 KB for Fermi cards) which is on-chip and can be accessed 
with latencies around a hundred times smaller than those for global memory. The large device or 
global memory (up to 6 GB in current Tesla systems) is connected with read/write access to all 
multiprocessors. Latency for global memory accesses is of the order of 400-600 clock cycles (as 
compared to, e.g., one clock cycle for a multiply or add operation). The additional constant and 
texture memory areas are cached and designed for read-only access from the processing units, in 
which case they operate very fast with small latencies. The memory of the host computer cannot 
be accessed directly from within calculations on the GPU, such that all relevant data need to be 
copied into and out of the GPU device before and after the calculation, respectively. Since the 
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Figure 2: Parallel execution of a GPU program ("kernel") in a grid of thread blocks. Threads within a block work 
synchronously on the same data set. Different blocks are scheduled for execution independent of each other 

processing units of each multiprocessor are designed to perform identical calculations on dif- 
ferent parts of a data set, flow control for this SIMD (single instruction multiple data) type of 
parallel computations is rather simple. It is clear that this type of architecture is near ideal for the 
type of calculations typical for computer graphics, namely rendering a large number of triangles 
in a 3D scene or the large number of pixels in a 2D projection in parallel. 

The organization of processing units and memory outlined in Fig. 1 translates into a combi- 
nation of two types of parallelism: the processing units inside of each multiprocessor work syn- 
chronously on the same data set (vectorization), whereas different multiprocessors work truly in- 
dependent of each other (parallelization). The corresponding programming model implemented 
in the CUDA framework [14] is outlined schematically in Fig. 2: computations on GPU are 
encapsulated in functions (called "kernels") which are compiled to the GPU instruction set and 
downloaded to the device. They are executed in a two-level hierarchic set of parallel instances 
("execution configuration") called a "grid" of thread "blocks". Each block can be thought of as 
being executed on a single multiprocessor unit. Its threads (up to 512 for the GT200 architec- 
ture and 1024 for Fermi cards) access the same bank of shared memory concurrently. Ideally, 
each thread should execute exactly the same instructions, that is, branching points in the code 
should be reduced to a minimum. The blocks of a grid (up to 65536 x 65536) are scheduled in- 
dependently of each other and can only communicate via global memory accesses. The threads 
within a block can make use of cheap synchronization barriers and communicate via the use of 
shared (or global) memory, avoiding race conditions via atomic operations implemented directly 
in hardware. On the contrary, the independent blocks of a grid cannot eff'ectively communicate 
within a single kernel call. If synchronization between blocks is required, consecutive calls of 
the same kernel are required, since termination of a kernel call enforces all block computations 
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and pending memory writes to complete. 

Since the latency of global memory accesses is huge as compared to the expense of ele- 
mentary arithmetic operations, many computational tasks on GPU will be limited by memory 
accesses, i.e., the bandwidth of the memory subsystem. It is therefore crucial for achieving good 
performance to (a) increase the number of arithmetic operations per memory access, and (b) 
optimize memory accesses by using shared memory and choosing appropriate memory access 
patterns. The latter requirement in particular includes the adherence to the specific alignment 
conditions for the different types of memory and clever use of the coalescence phenomenon, 
which means that accesses to adjacent memory locations issued by the different threads in a 
block under certain conditions can be merged into a single request, thus greatly improving mem- 
ory bandwidth. Due to most typical computations being limited by memory accesses, it is im- 
portant to use an execution configuration with a total number of threads (typically several 1000) 
much larger than the total number of processing units (240 for the Tesla CI 060 and 480 for the 
GTX 480). If a thread block issues an access, e.g., to global memory, the CPU's scheduler will 
suspend it for the number of cycles it takes to complete the memory accesses and, instead, ex- 
ecute another block of threads which has already finished reading or writing its data. The good 
performance of these devices thus rests on the idea of latency hiding and transparent scalability 
through flexible thread scheduling [29]. 

3. Metropolis simulations of the ferromagnetic Ising model 

The layout of the CPU architecture outlined above implies guidelines for the efficient imple- 
mentation of computer simulation codes. Along these lines, a code for single-spin flip Metropolis 
simulations of the ferromagnetic Ising model is developed. 

3.1. General considerations 

We consider a ferromagnetic, nearest-neighbor, zero-field Ising model with Hamiltonian 

9i=-jY,SiSj, Si = ±\ (1) 

<U> 

on square and simple cubic lattices of edge length L, using periodic boundary conditions. In 
the single-spin flip Metropolis simulation scheme for this model each step consists of randomly 
selecting a spin i g {!,..., = L^} and proposing a spin flip s, -s„ which is accepted 
according to the Metropolis criterion [1] 

Pacc(^;^-^0=miii[l,e-^^^], (2) 

where hEjJ = 2s, Y^j nn ; corresponds to the energy change incurred by the flip and fi - l/ksT 
denotes the inverse temperature. It is straightforward to show that this update is ergodic, i.e., 
all states of the system can be reached in a finite number of spin-flip attempts with non-zero 
probability, and satisfies the detailed balance condition ensuring convergence to the equilibrium 
Boltzmann distribution. In practice, one usually walks through the lattice in a sequential fashion 
instead of picking spins at random, which requires less pseudo-random numbers and, generically, 
leads to somewhat faster convergence. This updating scheme, in fact, violates detailed balance, 
but is consistent with the global balance condition which is sufficient to ensure convergence [30]. 
This point is of some importance for the GPU implementation developed here, and will be further 
discussed below in Section 3.4. 
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Figure 3: Double checkerboard decomposition of a square lattice of edge length L = 32 for peii'orming a single spin-flip 
Metropolis simulation of the Ising model on GPU. Each of the B X B = 4 X 4 big tiles is assigned as a thread block to a 
multiprocessor, whose individual processors work on one of the two sub-lattices of all T X T = 8 X 8 sites of the tile in 
parallel. 

3.2. Double checkerboard decomposition 

To leverage the effect of the massively parallel architecture of GPU devices for simulations of 
spin models, in most cases domain decompositions where the system is divided into a large num- 
ber of largely independent sub-units are the only approach with satisfactory scaling properties as 
the number of processors or the size of the system is increased. For the case of lattice systems, 
the simplest scheme amounts to a coloring of lattice sites with the minimally feasible number of 
colors. Here, I focus on bipartite graphs such as the square and (hyper-)cubic lattices where two 
colors suffice, resulting in a (generalized) checkerboard decomposition [31]. Generalizations to 
other cases are trivial. 

In such a scheme, each site of one sub-lattice can be updated independently of all others 
(assuming nearest-neighbor interactions only), such that all of them can be treated in parallel fol- 
lowed by an analogous procedure for the second sub-lattice. For an implementation on GPU, the 
checkerboard decomposition needs to be mapped onto the structure of grid blocks and threads. 
Although, conceptually, a single thread block suffices to update one sub-lattice in parallel, the 
limitation to 512 threads per block for the GT200 architecture (resp. 1024 threads for Fermi) 
enforces an additional block structure for systems of more than 1024 resp. 2048 spins. For the 
square lattice, in Ref. [11] stripes with a height of two lattice spacings were assigned to inde- 
pendent thread blocks while performing all spin updates in global memory. This has several 
disadvantages, however: (a) shared memory with its much lower latency is not used at all, (b) 
determining the indices of the four neighboring spins of each site requires costly conditionals due 
to the periodic boundary conditions, and (c) the system size is limited to L < 1024 for GT200 
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(L < 2048 for Fermi). Here, instead, I suggest a more versatile and efficient approach leveraging 
the inherent power of the memory hierarchy intrinsic to GPUs. To this end a twofold hierarchy 
of checkerboard decompositions is used. Figure 3 illustrates this for the case of a square lattice: 
on the first level, the lattice is divided into B X B big tiles in a checkerboard fashion. These 
are then updated by individual thread blocks, where the individual threads exploit a second-level 
checkerboard decomposition of the T x T lattice sites inside each tile. The size of big tiles is 
thereby chosen such that the spin configuration on the tile (plus a boundary layer of spins) fits 
into shared memory, such that the spin updates can then be performed entirely inside of this 
much faster memory area. On the block level, through the checkerboard arrangement all tiles of 
one ("even") sub-lattice can be updated concurrently before updating the other ("odd") half of 
tiles. Inside of each block, again all sites of the finer sub-lattices are independent of each other 
and are therefore updated concurrently by the threads of a thread block. 
In summary, the updating procedure looks as follows: 

1. The updating kernel is launched with B^/2 thread blocks assigned to treat the even tiles of 
the coarse checkerboard. 

2. The threads of each thread block cooperatively load the spin configuration of their 
tile plus a boundary layer into shared memory. 

3. The threads of each block perform a Metropolis update of each even lattice site in their tile 
in parallel. 

4. All threads of a block wait for the others to finish at a barrier synchronization point. 

5. The threads of each block perform a Metropolis update of each odd lattice site in their tile 
in parallel. 

6. The threads of each block are again synchronized. 

7. A second kernel is launched working on the odd tiles of the coarse checkerboard in 

the same fashion as for the even tiles. 

The cooperative loading of each tile into shared memory is organized in a fashion to ensme 
coalescence of global memory accesses [29], i.e., subsequent threads in each block access con- 
secutive global memory locations wherever possible. While it turns out to be beneficial to load 
tiles into shared memory already for a single update per spin before treating the second coarse 
sub-lattice due to an avoidance of bank conflicts in global memory as well as the suppressed 
need to check for periodic wrap-arounds in determining lattice neighbors, the ratio of arithmetic 
calculations to global memory accesses is still not very favorable. This changes, however, if a 
multi-hit technique is applied, where the spins on each of the two coarse sub-lattices are updated 
several times before returning to the other half of the tiles. As is discussed below in Sec. 3.4, this 
works well, in general, and only close to criticality somewhat increased autocorrelation times are 
incurred. 

3.3. Random number generation 

By design, Monte Carlo simulations of spin systems require a large amount of pseudo- 
random numbers. Depending on implementation details, for a system as simple as the Ising 
model, the time required for generating random numbers can dominate the total computational 
cost. Hence, the efficient implementation of random-number generators (RNGs) in a massively 
parallel environment is a crucial requirement for the efficient use of the GPU architecture. Speed 
is not everything, however, and it is well known that the statistical deficiencies that no pseudo 
RNG can entirely avoid can have rather dramatic consequences in terms of highly significant 
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deviations of simulation results from the correct answers [32, 33]. Hence, some effort must be 
invested in the choice and implementation of an appropriate RNG. 

From the highly parallel setup described above, it is clear that each of the N = L?/4 concur- 
rent threads must be able to generate its stream of random numbers to a large degree indepen- 
dently of all others, since the bottleneck in any configuration with a centralized random-number 
production would severely impede scaling of the code to a large number of processing units. As 
each of these (sub-)sequences of random numbers are used in the same simulation, one needs 
to ensure that (a) either each thread generates different members of the same global sequence of 
numbers or (b) the sequences generated by different threads are at least statistically uncorrelated. 
The simplest choice of RNG is given by the linear congruential generator (LCG) of the form 

x„+\ = {ax„ + c)modOT (3) 

with m = 2^2. The authors of Ref. [11] used a = 1664525 and c = 1013904223, originally 
suggested in Ref. [34]. The period of this generator is small with p - 2^^ a; 4 x 10^ [35]. In 
a simulation of a 4096 x 4096 Ising system, for instance, this period is exhausted already after 
256 sweeps. On theoretical grounds, it is argued that one actually should not use more than ->Jp 
numbers out of such a sequence [35, 36], which would render the sequence of available numbers 
very short indeed. Additionally, simple LCGs are known to exhibit strong correlations which can 
be revealed by plotting A:-tuples of successive (normalized) numbers as points in R*, where it is 
found that, already for rather small k, the points are confined to a sequence of hyperplanes instead 
of being uniformly distributed. The choice m - 1?^ has the advantage that the whole calculation 
can be implemented entirely in 32-bit integer arithmetic since on most modem architectures 
(including GPUs) integer overflows are equivalent to taking a modulo operation. For such power 
of two moduli m, however, the period of the less significant bits is even shorter than that of the 
more significant bits, such that the period of the Ath least significant bit is only 2*. An advantage 
for the parallel calculations performed here is that one can easily skip ahead in the sequence, 
observing that 

Xn+t = (atXn + c,)modOT, (4) 

where 

t 

at = a' mod m, fcf = ^a'cmodw. (5) 

1=1 

Therefore, choosing t equal to the number = of threads, all threads can generate numbers 
out of the same global sequence (3) concurrently. An altemative setup, that cannot guarantee the 
desired independence of the sequences associated to individual RNG instances, however, starts 
from randomized initial seeds for each generator, without using any skip-ahead [11]. To improve 
on the drawback of a short period, one might consider moving to a generator with larger m, for 
instance m - 2^, where the modulo operation again can be implemented by overflows, this time 
for 64-bit integers, a data type which is also available in the CUDA framework. As multiplier I 
chose a = 2862933555777941757 with provably relatively good properties [37], where an odd 
ofl^set, here c = 1442695040888963407, needs to be chosen to reach the maximal period of 
p -2^ x2y. 10^^. As for the 32-bit case, this generator can be used in the parallel framework 
yielding numbers from a common sequence via skip-ahead, or as independent generators with 
initial randomized seeds, leading to overlapping sub-sequences. 

For high-precision calculations, one normally would not want to rely on a simple LCG. A 
significant number of generators with much better statistical properties has been developed in the 

9 



1.0 
0.9 
0.8 
0.7 
a 0.6 
'Z 0.5 
•I 0.4 
0.3 
0.2 
0.1 
0.0 



■ Fibonacci 

■ LCG64 

■ LCG32 



\ 

\\ 

\\ 

w 

\\ 
\\ 
w 



30 60 90 120 

# blocks 



150 



180 



Figure 4: Computer time per random number for running diflferent implementations of pseudo RNGs on a Tesla CI 060 
GPU as a function of the number of grid blocks employed. The "Fibonacci" generator corresponds to the recursion (7) 
with a = p = \ and r = 521, i = 353, while "LCG32" and "LCG64" correspond to the recursion (3) with m = 2^^ and 
m = 2^, respectively. All thread blocks use 512 threads. 



past, for a review see Ref. [36]. For our purposes, however, most of them have severe drawbacks 
in that they are either quite slow (such as, for instance, generators that combine a number of 
elementary RNGs or, instead, drop a certain fraction of random numbers generated as in RAN- 
LUX [38]), that they use a large state to operate on (for instance 624 32-bit words for the popular 
Mersenne twister [39]), or that it is rather hard to ensure that sequences generated independently 
are to a reasonable degree uncorrected (as for most high-quality generators). While a state larger 
than a few words is usually not a problem for serial implementations, in the framework of GPU 
computing where fast local (i.e., shared) memory is very limited, it is impossible to use several 
hundred words per thread of a thread block only for random number generation^. A reason- 
able compromise could be given by generators of the (generalized) lagged Fibonacci type with 
recursion 

x„ — ax„-r ffi bx„-s mod m, (6) 

which operate on a buffer of length r + s with s < r and have good properties for sufficiently 
large lags r and s and choosing © = ± [40]. If m = 2"', the maximal period is p - 2"'"' (2'' - 1). 
For an implementation in single precision arithmetic, i.e., w = 32, and r = 1279 (see below), this 
results in a rather large period p x 2x 10^^"*. The recursion (6) can be implemented directly in 
floating-point arithmetic yielding uniform random numbers m, e [0, 1 ] via 

u„ — aun^y + /3u„_s mod 1 . (7) 

For good quality one needs r > 100, leading to relatively large storage requirements, but here the 
generation of s random numbers can be vectorized by the threads of a thread block. Therefore, 



^Note that the "official" CUDA implementation of Mersenne twister [14] uses data structures entirely in global mem- 
ory. 
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Table 1 : Internal energy e per spin and specific heat Cy for a 1 024 X 1024 Ising model with periodic boundary conditions 
at yS = 0.4 from simulations on CPU and on GPU using dilferent random number generators. "LCG32" and "LCG64" 
correspond to the recursion (3) with m = 2^^ and m = 2^, respectively, while "Fibonacci" denotes the generator (7) with 
r = 521, s = 353 and r = 1279, s = 861, respectively. For the "random" versions of LCGs on GPU, the generators 
for each thread are seeded randomly, while in the other cases all threads sample from the same sequence using skipping 
according to Eq. (4). Ajgi denotes the deviation from the exact result relative to the estimated standard deviation. 



method 


e 


A,-el 


Cv 


A,-el 


exact 


1.106079207 





0.8616983594 





sequential update (CPU) 


LCG32 


1.1060788(15) 


-0.26 


0.83286(45) 


-63.45 


LCG64 


1.1060801(17) 


0.49 


0.86102(60) 


-1.14 


Fibonacci, r = 512 


1.1060789(17) 


-0.18 


0.86132(59) 


-0.64 


checkerboard update (GPU) 


LCG32 


1.0944121(14) 


-8259.05 


0.80316(48) - 


121.05 


LCG32, random 


1.1060775(18) 


-0.97 


0.86175(56) 


0.09 


LCG64 


1.1061058(19) 


13.72 


0.86179(67) 


0.14 


LCG64, random 


1.1060803(18) 


0.62 


0.86215(63) 


0.71 


Fibonacci, r = 512 


1.1060890(15) 


6.43 


0.86099(66) 


-1.09 


Fibonacci, r = 1279 


1.1060800(19) 


0.40 


0.86084(53) 


-1.64 



the required ring buffer of length r+s words is shared by the threads of a block, and hence 
the storage requirement in only 2(r+s)/T^ words per thread. If one chooses s only slightly larger 
than the number of threads per block, only a few words per thread are consumed. A number of 
good choices for the lags r and s are collected in Ref. [40]. The best performance is achieved for 
unity multipliers a - fi - 1, at only moderately reduced quality of the resulting random numbers. 
In this setup, different thread blocks use different sequences. Prescriptions for ensuring disjunct 
sequences are described in Ref. [40]. In view of the astronomic period, however, it appears safe 
to just seed the ring buffers of the generators for different blocks with an independent RNG. 

The LCG generator (3) with m = 2^^ and m = 2^ as well as the generalized Fibonacci 
generator (7) were implemented and benchmarked on a Tesla CI 060 GPU. The LCGs have been 
realized with 32- and 64-bit integers, respectively, using the multipliers listed above. For the 
Fibonacci generator, I chose a = = I and lags r = 521 and s = 353 (suitable for simulations 
with tile size up to T - 16). The generators were test with a variable number of blocks of 
512 threads each (where 512 is the maximum number of threads allowed on the CI 060 device). 
All generator state information is kept in shared memory. The results are displayed in Fig. 4. 
The characteristic zig-zag pattern results from the commensurability (or not) of the number of 
blocks b with the number of 30 multiprocessors on the device, such that the performance is best 
ifb = 30i, ( = 1,2,... and worst \fb- 30/' -H 1, since in the latter case 29 of the multiprocessors 
are idle once the last batch of blocks has been scheduled for execution. The 32-bit LCG is found 
to take at best 0.036 ns per random number, whereas the 64-bit LCG and the lagged Fibonacci 
generator are about 3.5 and 4.3 times slower, respectively. 

The quality of the streams of random numbers generated in this way was tested with the 
Metropolis update simulation of the 2D Ising model using the double checkerboard update. The 
resulting estimates for the internal energy and specific heat of a system of edge length L = 1024 
and inverse temperature jS = 0.4 from simulations employing a number of different generators 
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are collected in Tab. 1. For comparison, the exact results calculated according to Ref. [41] are 
shown as well. For the sequential updates on CPU, the dominant correlations and short period of 
the 32-bit LCG lead to highly significant deviations at the chosen rather high level of statistics 
(10^ updates per spin), which are, however, only visible in the specific heat data. Already the 
64-bit LCG and likewise the Fibonacci generator result in data completely consistent with the 
exact values at the given level of statistical accuracy. The double checkerboard simulations on 
GPU, on the other hand, appear to be much more sensitive to correlations in the RNGs. If 
all threads sample from the same sequence of random numbers using skipping according to 
Eq. (4), the 32-bit LCG produces astronomic and the 64-bit LCG still sizable deviations from the 
exact result, cf. the data collected in Tab 1 . Somewhat surprisingly, no significant deviations are 
produced from either LCG when seeding the LCG of each thread randomly and independently 
with another RNG. In some sense, this setup appears to mimic the effects of a combined generator 
[34]: although all of the sequences traversed by individual threads have significant overlap with 
each other, the random shifts between different sequences are sufficient to cancel the correlation 
effects. The Fibonacci generators of Eq. (7), on the other hand, do not suffer from such problems, 
although deviations are again larger than for the sequential implementation, and one needs to go 
to a lag as large as r = 1279 to get fully satisfactory results for the system at hand. Note that the 
observed differences between CPU and GPU results are entirely due to the different order of spin 
updates in the sequential and checkerboard setups, and have nothing to do with shortcomings 
of the GPU architecture. In particular, all the generators (and variants) implemented here on 
GPU in connection with the corresponding simulation codes guarantee completely reproducible 
results, i.e., the sequence of generated spin configurations only depends on the initial seeding of 
the RNGs and not the order in which different blocks or threads are executed. 

For general practical applications, the use of the lagged Fibonacci generator appears to be 
a reasonable compromise in that it is only rather insignificantly slower than the 64-bit LCG, 
but produces random numbers of substantially better quahty. The setup with randomly seeded 
independent LCGs per thread also yields satisfactory results (and can be very fast), but in the 
absence of a deeper understanding of how correlations are destroyed in this setup, it cannot 
be entirely recommended for high-precision studies of simple, discrete models such as the Ising 
ferromagnet considered here. For systems with continuous degrees of freedom or with additional 
quenched disorder, however, such correlation effects are presumably less of a problem. 

3.4. Balance and autocorrelations 

It is worthwhile to note that the checkerboard update, like the sequential update of lattice 
sites usually applied in a scalar computing context, does not satisfy detailed balance (see, e.g., 
Ref. [30]), 

T({s]} ^ {si})pp{{s\}) = T{{si} ^ {s',})pp{\si}\ (8) 

on the level of single spin flips since the probability of selecting a spin twice for updating in 
direct succession vanishes if the lattice sites are treated in a deterministic order. The global 
balance condition, 

2 n{s',} ^ {Si})pp{{s'i]) - Yj ^ {^;})/'^({*'}) = (9) 

on the other hand, which is sufficient to ensure convergence of the chain [42], stays satisfied at all 
times. The same holds true ff the checkerboard update is applied in a multi-hit fashion with each 
coarse sub-lattice being updated k>\ times before moving on to the other sub-lattice. Although 
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Figure 5: Integrated autocon'elation times of the internal energy for simulations of an 128 X 128 Ising model as a function 
of inverse temperature /J. The GPU calculations were performed with different numbers k of multi-hit updates, but at a 
constant total number of updates. 



detailed balance is not required for sampling the equilibrium distribution, it can be recovered, if 
desired, e.g., to ensure that certain dynamical properties hold, on the level of compound moves 
such as lattice sweeps. Denote, for instance, an update of the even sub-lattice with "A" and an 
update of the odd sub-lattice with "B" and a measurement time by "M". Then, a sequence of the 
form 

AAAA{M)AAAABBBB(M)BBBBAAAA(M)AAAABBBB(M)BBBB ■ ■ ■ 

satisfies detailed balance with respect to compound updates and is hence more symmetric than 
the naive multi-hit implementation. 

While the multi-hit approach with A: > 1 is a perfectly valid updating scheme, it is clear that 
very close to the critical point it will be somewhat slower than the k - I version at decorrelating 
the system as soon as the correlation length ^ exceeds the size T of the tiles. Assuming that 
^ > r, it takes 

Ttile ~ 

sweeps to decorrelate the configuration of the tile or, equivalently, for information traveling dif- 
fusively to cross the tile. For k » Ttiie, one expects multi-hit updates to become ineflicient. 
Figure 5 shows the resulting integrated autocorrelation times [43, 44] rint of the internal energy 
of a 128 X 128 Ising system as a function of inverse temperature /3 and comparing the CPU cal- 
culation with GPU simulations with multi-hit updates ranging from k - 1 up to = 100, using 
tiles of T X T = 16 x 16 spins. Note that, in the same way as sequential spin updates in a 
scalar simulation have different autocorrelation times than purely random updates, the checker- 
board (i.e., GPU) way of updating the spins yields slightly diff'erent autocorrelation times than 
the sequential update on CPU. This effect, however, appears to be weak. Apart from the gen- 
eral expected rounded divergence of autocorrelation times in the vicinity of the critical coupling 
= 5 ln(l + ^/2) X 0.44069, Fig. 5 clearly reveals the relative increase of Tint ^ is increased. 
As a result, in the vicinity of /3c there are two competing efifects: increasing k reduces the over- 
head of loading tiles from global memory and thus decreases spin-flip times, whereas at the 
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Figure 6: Re-scaled autocorrelation times fi„t = Tjnifflip for the 128 X 128 Ising system and double checkerboard updates 
on GPU for different values of the multi-hit parameter k. Minimal fi„t corresponds to a minimum time of generating a 
statistically independent spin configuration. 

same time it increases autocorrelation times, thus reducing sampling efficiency. As a result, the 
autocorrelation time in physical units, 

Tint — T"intfflip> 

where fflip denotes the real time for a single spin flip, such that L^fi„i corresponds to the real 
time for generating an effectively uncorrected spin configuration, close to /3c has a minimum at 
intermediate values of k. This is illustrated in Fig. 6, again for a 128 x 128 system with tile size 
T - 16, where it is seen that, in the vicinity of /3c, multi-hit updates with A; ~ 10 appear to be 
optimal. 

3.5. Performance in 2D 

Using the double checkerboard decomposition loading tiles into shared memory and the 
RNGs discussed above, an optimized code was first constructed for the case of the square-lattice 
Ising model. The sample code, which has undergone significant further optimizations as com- 
pared to the version discussed in Ref. [45], can be downloaded from the author's web site [46]. 

It is a well-known trick valid for discrete models to tabulate the exponential occurring in 
the acceptance criterion (2) in advance and use a look-up instead of calculating the exponential 
directly. It appears natural to store this array in constant memory which is fast through caching 
[29]. This memory, however, is optimized for the different threads of a block synchronously 
accessing the same memory element, which is not the typical case here since different spins 
usually have different local interaction energies. It is found more efficient, therefore, to store this 
table in texture memory which is fast with different threads accessing different memory locations. 
While for the CPU code it is beneficial to save arithmetic cycles by drawing random numbers 
and looking up the value of the Boltzmann factor only when A/i > 0, this is not the case on GPU, 
where such an approach results in thread divergence requiring the corresponding branch to be 
visited twice. For the code discussed here, I used a single variable of the high-level language to 
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Figure 7: Spin-flip times in nanoseconds for k = 100 multi-hit updates of the square-lattice Ising model as a function of 
linear system size L. The lines correspond to different block sizes T. 



represent a spin. Performance differences depending on whether int (32 bit), short int (16 
bit) or char (8 bit) variables are used are only very moderate. The results discussed below are 
for int variables. A multi-spin coded version is discussed below in Sec. 5. 

The optimum choice of the size T of tiles depends on system size and is determined by the 
relative resource utilization on the considered GPU device. The corresponding measured spin- 
flip times for k = 100 shown in Fig. 7 are easily understood from the parameters of the Tesla 
CI 060 device at hand: tiles should be large enough to provide a sufficient number of threads in 
each block. The optimum load is seen when each multiprocessor is assigned with the maximum 
number of eight thread blocks. Since each multiprocessor can accommodate a maximum allowed 
number of 1024 threads, this results in 128 threads per block. This number is reached for tiles 
of size 16 X 16 spins (remember that each thread updates two sites). This choice turns out to be 
optimal for lattice sizes L > 256. For smaller lattices, on the other hand, T - 16 does not provide 
enough blocks to load all 30 multiprocessors of the device which requires B^/2 = {L/T)^/2 > 30 
or r < 8 for L = 128 and T < 4 for L = 64. Since T = 4 is the smallest tile size considered, 
this is also optimal for the smallest lattice sizes L = 16 and L = 32. For smaller values of k, 
the performance of global memory accesses comes into play, which favors larger tile sizes due 
to the resulting improvements in coalescence, such that T = 16 turns out to be optimal for any 
L > 32 there. Test simulations for the Fermi architecture on a GTX 480 card lead pretty much 
to the same conclusions: there, the maximum number of resident threads per multiprocessor has 
been increased to 1536, but this increase is too small to lift the optimum tile size to T = 32 such 
that, again, T = 16 is optimal apart from simulations on the smallest lattices. 

Figure 8 shows the performance data for the final code as compared to the scalar CPU im- 
plementation. As a reference CPU 1 used an Intel Core 2 Quad Q9650 at 3.0 GHz, which is one 
of the fastest available CPUs for single-threaded applications. The performance of the latter is 
essentially independent of system size with about 8 ns per spin flip. The maximum performance 
on GPU is reached only for system sizes L > 4096 with about 0.27 ns per spin flip for k - I 
and 0.076 ns per spin flip for k = 100, resulting in speed-up factors ranging from 30 to 105. The 
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Figure 8: Time per spin-flip in the 2D Ising model on CPU and on GPU with difl'erent choices of k. GPU data are for the 
Tesla CI 060 device apart from the lowest curve which is for a GTX 480 card. All data are for tile size T = 16 and an 
inverse temperature fi = 0.4 close to, but off the critical coupling. There is a weak temperature dependence of spin-flip 
times due to differences in the number of write operations. 

Fermi architecture results in an extra speed-up of about 2.3 compared to the Tesla C1060 device, 
leading to a maximum performance of 0.034 ns per spin flip at ^ = 100 or an overall speedup 
factor of 235 (the extra speedup for A; = 1 is even slightly larger, probably due to the reduced 
overhead of kernel calls for the Fermi architecture). Comparing to the previous implementa- 
tion in Ref. [11], one notes that, probably due to inappropriate cache alignment, the (single-spin 
coded) CPU code used there is rather inefficient, such that, e.g., for a L = 4096 system the code 
used here is about 10 times faster than that of Ref [11] and 5 times faster than the results quoted 
in Ref. [23], leading to a rather unfavorable comparison for the CPU based systems in these 
works. Their multi-spin coded CPU version is less than a factor of two more efficient than the 
single-spin coded implementation used here (see below in Sec. 5 for a multi-spin coded version 
of the program used here). The maximum performance of the GPU implementation of Ref. [11] 
with 0.7 ns on the Tesla C1060 is about 2.6 times less than the present k - \ code and a fac- 
tor of 9 slower ai k - 100. The significant speedup at A: = 1 is due to the combined effect of 
the avoidance of bank conflicts in computing the local interaction energy, the fact that periodic 
boundaries do not need to be taken account with expensive modulo operations once the tile in- 
cluding its boundary layer has been loaded into shared memory, and the efficient implementation 
of the Metropolis criterion using texture fetches. The multi-spin coded GPU code (shared mem- 
ory version) of Ref. [23] is about of the same performance as our single-spin coded version at 

In a practical application measurement cycles are required on top of the updates, of course. 
Energy measurements were here realized with tracking the local energy changes at each spin flip 
and combining them in a final parallel reduction step after the update. At A: = 100, corresponding 
to only one measurement per one hundred lattice sweeps, the overhead is small, of the order of 
10%, see the data collected in Fig. 9. At A: = 1, the overhead is about 20%. In view of the potential 
problems with the quaUty of random numbers discussed above in Sec. 3.3, it is worthwhile to 
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Figure 9: Spin-flip times in nanoseconds for flie 2D Ising GPU code employing difi"erent random number generators and 
showing the efl'ect of including energy measurements. All runs for k = 100. 



check the influence of the different speeds of RNGs discussed above on the overall performance 
of the Ising simulation code. In a simulation as simple as the Metropolis update of the Ising 
model, a substantial fraction of time is spent for generating random numbers. Therefore, it is 
no surprise that it is found that the simulation is slower by a factor of 1.73 when using the 
LCG64 instead of the LCG32 generator and by a factor of 2.63 when using the lagged Fibonacci 
generator ik - 100, L-\6 384). In view of the good results for the randomly initialized LCG32 
reported in Sec. 3.3, it is probably acceptable to use it for the considered model in view of its 
good performance. For other applications, however, this issue would need to be revisited. 

3.6. Performance in 3D 

The generalization of the GPU code from two to three dimensions is straightforward, and only 
the part responsible for the correct bookkeeping in loading a tile into shared memory becomes 
more tedious to formulate. The range of allowable tile sizes is somewhat more restricted by the 
limitation of the available GPUs to 5 12 threads per block (resp. 1024 for Fermi). With the current 
setup, therefore, the only tile sizes that can be reasonably considered are T - A and T = 8, if one 
restricts oneself to powers of two in system as well as in tile sizes. More general and, perhaps, 
versatile, setups could be reached by relaxing the power-of-two condition or considering tiles 
with non-unit aspect ratios, which might lead to additional slight increases in performance. For 
the sake of simplicity, however, I here refrain from introducing such complications. 

The timing results for the 3D code are summarized in Fig. 10. The performance of the CPU 
code is essentially independent of lattice size within the considered range of L = 8 up to L = 512, 
arriving at around 14 ns per spin flip on the reference CPU used here, which is slower than the 
2D code by a bit more than the factor of 6/4 - ijl expected merely from the increased number 
of nearest neighbors to be included in energy calculations. For the GPU code, we find rather 
comparable performance from T - A and T - % with a slight advantage for the latter for lattice 
sizes exceeding L - 64. The maximum performance on the Tesla CI 060 is around 0.13 ns (at 
k = 100), about by the expected factor of 2/3 less than in two dimensions. The Fermi card is 

17 




Figure 10: Timings per spin flip for the 3D Ising CPU code on tlie Tesla CI 060 and GTX 480 (Fermi) devices as 
compared to tlie reference CPU implementation. 



faster by another factor of 1.9, resulting in a peak performance of 0.067 ns per spin flip. As 
a consequence, the observed maximal speed-up factors are 110 for the CI 060 and 210 for the 
GTX 480. These speed-ups are very close to those observed in two dimensions, such that the 
relative efficiency of the CPU and GPU implementations are practically identical in two and three 
dimensions. 



4. Continuous spins 

A natural generalization of the Ising model is to systems with continuous degrees of freedom, 
i.e., 0(n) spin models with Hamiltonian 

-K^ -7 2 = (10) 

('■,» 

which are, most often, more realistic for the description of real magnetic materials. For im- 
plementations of simulation algorithms on GPU, this off'ers the opportunity to test the relative 
performance of floating-point operations, an application for which GPUs are highly optimized 
by design, since their original purpose in rendering graphics in technical terms to a large percent- 
age reduces to manipulations of matrices of floating-point numbers. 

Since for the original graphics purposes calculations in double precision usually are not rel- 
evant, traditionally GPU devices have only offered single-precision arithmetic. Although double 
precision calculations could be emulated in software, native support for double precision arith- 
metic was only introduced with the GT200 architecture, but it came with a massive performance 
penalty, double precision being around eight times slower than single precision. Only with the 
recent Fermi architecture, this drawback has been partially removed, double precision calcula- 
tions there being only around a factor of two slower than single precision arithmetic. Sufficient 
precision is certainly important for numerical algorithms that have a potential of accumulating 
rounding and finite-precision errors such as, e.g., molecular dynamics simulation codes. For 
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Figure 11: Timings per spin flip for the 2D Heisenberg model simulations on the Tesla CI 060 and GTX 480 (Fermi) 
devices as compared to the reference CPU implementation. The data are for L = 4096 and k = 100. 

Monte Carlo simulations, on the other hand, it is rather obvious that due to the random sampling 
rounding errors and numerical accuracy are much less of an issue, and it might be acceptable 
to represent the degrees of freedom in single precision and use double precision only for ac- 
cumulated quantities in measurements. Implementations in single and in double precision are 
benchmarked here, also from the point-of-view of the resulting accuracy. 

In addition to the question of the number of digits, the floating-point implementation on 
CUDA devices is not completely compliant with the IEEE-754 floating-point standard which is 
otherwise practically universally accepted for current computing platforms. In particular, de- 
vices prior to the Fermi series do no support the so-called denormalization feature (for single- 
precision calculations) that ensures sufficient accuracy for numbers close to zero. Furthermore, 
due to device-specific implementations, the accuracy of operations such as division and square 
root is sometimes less than the 0.5 units in the last place (ULP) prescribed by the IEEE stan- 
dard. Additionally, there are some less dramatic deviations such as the absence of a signaling of 
floating-point exceptions, the lack of configurable rounding modes and others. The mentioned 
limitations are removed with the advent of Fermi cards, which can be configured to almost com- 
pletely adhere to the IEEE standard. Regarding transcendental functions such as exponential, 
sine and logarithm, device-specific and highly performant, but less accurate, versions are pro- 
vided in the CUDA framework in addition to the standard C library implementations that are 
also available [14]. (Note, however, that these highly performant implementations are only avail- 
able in single-precision variants.) Apart from these questions of adherence to the floating-point 
standard and whether single-precision calculations are sufficient for the purpose at hand, one 
must keep in mind that parallel algorithms in themselves might lead to finite-precision effects 
which are different from those seen in serial algorithms. If for instance, a binary-tree algorithm 
is used to perform a parallel reduction over an array of values to evaluate their sum (as in the 
energy measurement code of the algorithms discussed here), the result will be slightly different 
to that found in a serial calculation since in finite precision addition is not commutative. Such 
effects can be reduced, however, with appropriate adaptations, for instance sorting the array by 
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device 


mode 


hip [ns] 


speedup 


CPU (Intel Q9650) 


float or double 


183.16 


1 




float 


0.74 


247 


TeslaC1060 


float, fastjnath 


0.30 


608 




double 


4.66 


39 




float 


0.50 


366 


GTX 480 


float, fastjnath 


0.18 


1029 




double 


1.94 


94 



Table 2: Times per spin flip in ns for various implementations of single-spin flip simulations of the 2D Heisenberg model 
and speedup factors as compared to the CPU reference implementation. All data are for multi-hit updates with k = 100 
and system size L = 4096. 

size prior to the reduction, such that values of similar magnitude are added first. Such consider- 
ations are not specific to GPUs, of course, but need to be taken into account whenever parallel 
algorithms come into play. 

To study the efl'ects of these particularities of the massively parallel GPU architecture on the 
performance and accuracy of a Monte Carlo simulation, a Metropolis single-spin update simu- 
lation code was benchmarked for the case of the Heisenberg model. For the sake of simphcity, 
a new spin orientation is proposed uniformly at each update step, independent of the previous 
orientation of the spin. In practice, this leads to poor acceptance rates at low temperatures which 
could be improved by proposing small modifications instead of uniform re-orientations. For the 
benchmarks considered here, however, this question is irrelevant, since local reorientations lead 
to almost identical results for the relative performance of the different implementations consid- 
ered. In the standard implementation, drawing a random unit vector in three dimensions requires 
two random numbers and a total of five trigonometric function calls. In addition, there are 6d 
multiply-and-add operations for calculating the scalar products in determining the local energy 
change in an attempted fiip for simulations on a d-dimensional (hyper-)cubic lattice. Compared 
to the Ising system, the exponentials in the Metropolis criterion (2) cannot be tabulated in ad- 
vance, leading to an extra special-function call per attempted update. As a result of these differ- 
ences, an update step for the Heisenberg model is significantly heavier with arithmetic operations 
than an Ising update which, however, from the point-of-view of GPU computing is a rather fa- 
vorable situation since this reduces the probability of memory bandwidth limitations. (Note, 
however, that Heisenberg updates also require larger memory transfers since each spin variable 
requires three floating-point numbers.) 

Figure 1 1 compares the resulting spin-flip times for various GPU and CPU implementations 
and a system size of 4096 x 4096 spins. The CPU implementation has exactly the same perfor- 
mance for single and double precision implementations, which is plausible since most arithmetic 
units in current CPUs are double precision anyway. For single-precision calculations, the use of 
the intrinsic, fast implementations of the special functions results in a factor of 2.5 speedup of the 
entire code, while the use of double variables for representing spins results in an about sixfold 
slow-down for the Tesla CI 060 and a fourfold slow-down for the Fermi card. Overall speed- 
up factors are rather impressive, ranging from 39 for the double-precision version on the Tesla 
C1060 and 1029 for the single-precision code with fast special functions on the Fermi board, cf. 
the data collected in Tab. 2. It should be noted, of course, that reduced-precision implementa- 
tions of special functions are, in principle, also possible on CPU architectures, such that the extra 
speed-up gained by the use of these functions is not specific to using GPUs. 
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Figure 12: Average deviation of tiie spin normalization from unity in Heisenberg model simulations employing single- 
precision floating point arithmetic on CPU and on GPU as a function of the number of lattice sweeps. 



To see whether numerical stability and finite precision are relevant for the calculations con- 
sidered here, deviations from the required normalization \Si\ - 1 have been monitored for the 
different implementations. As can be seen from the results presented in Fig. 12, a drift of nor- 
malization is not an issue here. As expected, deviations of \Si\ from unity are of the order of 
10 for single-precision implementations uniformly for CPU as well as GPU implementations. 
Maybe surprisingly, deviations are even somewhat smaller for the GPU codes than for the CPU 
implementation. Using the fast intrinsic implementations of special functions, on the other hand, 
reduces the accuracy by about one digit, such that deviation are of the order of 10"^ there. Es- 
timates of the magnetization, internal energy and specific heat extracted from the simulation 
runs also have been compared between the different implementations and, at the given level of 
statistical accuracy, no significant deviations have been observed. (Note that reductions in the 
measurement cycles have been implemented in double precision.) It appears justified, therefore, 
to claim that for spin model simulations a representation of the spin variables in single precision, 
while still doing measurements in double precision, does not introduce systematic deviations of 
relevance as compared to statistical fluctuations. The same seems to be valid even when using 
the fast intrinsic, reduced accuracy implementations of the special functions. In view of the dra- 
matic speed-ups observed in particular for the "float, fastjnath" implementation, cf. Tab. 2, this 
appears encouraging. 



5. Spin glass 

Computationally particularly demanding problems in the field of spin systems occur for mod- 
els with quenched disorder [19]. The nature of fluctuations between disorder realizations typi- 
cally requires averages over at least several thousand disorder configurations to enable reliable 
results. For the case of spin glasses, where disorder is accompanied by the presence of compet- 
ing interactions, each disorder configuration exhibits extremely slow dynamics close to or below 
the spin-glass transition. Since no effective cluster algorithms are available for spin glasses (with 
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Figure 13: Spin-flip times in ns for simulating Wjcpiica independent disorder realizations of a 16 X 16 Ising spin glass 
system on the GT200 and Fermi architectures as compared to the CPU implementation. 

the exception of systems in two dimensions [21]), simulations are restricted to single-spin up- 
dates, and the combined effect of disorder average and slow relaxation results in rather excessive 
computational requirements. It is not surprising, therefore, that a number of special-purpose 
computers have been constructed for such problems, the most recent being the Janus machine 
based on FPGAs [7]. 

The trivial parallelism inherent in the disorder average and the necessary restriction to local 
spin updates appears to render spin-glass problems ideal candidates to profit from the massively 
parallel architecture of GPUs. To see to which degree such hopes are borne out in real implemen- 
tations, 1 studied the short-range Edwards- Anderson Ising spin glass model with Hamiltonian 

^^-Y^JijSiSj, Si^±\, (11) 

(iJ) 

where the exchange couplings Jij are quenched random variables, here drawn from a symmetric 
bimodal distribution PiJij) - \S(Jij - J) + \5{Jij + J)- Compared to the ferromagnetic Ising 
model (1), the disordered version (11) requires the values of the couplings Jjj to be taken into 
account when calculating the energy change of a spin flip such that an additional data structure 
for the couplings becomes necessary. For the tiled algorithm described in Sec. 3.2 above, the 
values of inside of each tile need to be loaded into shared memory in addition to the spin 
field. Furthermore, a number A^repiica of completely independent instances of the spin system 
with different coupling values can be simulated in parallel. These different instances are mapped 
to different thread blocks in the GPU architecture. Apart from these changes, the algorithm is 
identical to the implementation described for the ferromagnetic model. 

The spin-flip times resulting from this implementation are displayed in Fig. 13 as a function 
of the number A^iepUca of disorder realizations simulated simultaneously. The reference CPU 
implementation performs at around 14.6 ns per spin flip on the Intel Q9650 CPU used here, 
essentially independent of the number of realizations as those are merely worked on in sequence 
there. The GPU code reaches a maximum performance of 0.15 ns on the Tesla C1060 and 0.07 
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Figure 14: Spin-flip times in ns for simulations of the Ising spin-glass using asynchronous multi-spin coding with either 
iv = 32 or w = 64 bit variables as a function of the number of disorder realizations. Simulations are for a 16 X 16 system, 
comparing CPU and GPU codes on various devices. 

ns on the Fermi card, respectively, corresponding to speed-ups of 95 (CI 060) and 208 (GTX 
480). These speed-ups are practically identical to the maximum speed-ups observed for the 
ferromagnetic model, whereas in absolute terms due to the extra calculations related to the non- 
uniform couplings the CPU and GPU implementations for the spin-glass model are about a factor 
of two slower than the ferromagnetic codes. 

It is well known that storing the inherently one-bit information of the orientation of an Ising 
spin in a (typically) 4-byte integer variable is a wasteful operation and packing several spin vari- 
ables in each machine word can result in significant performance improvements [47]. While for 
the case of ferromagnetic models the only possible implementation needs to pack a number of 
spins from the same lattice into one word of w bits, leading to a somewhat involved sequence 
of bit-wise operations becoming necessary in order to perform the Metropolis updates of the 
coded spin in parallel [47] ("synchronous" multi-spin coding [7]), the situation in the presence 
of quenched disorder implies that one can also code w spins in the same spatial location of 
independent Ising lattices corresponding to different disorder configurations in one word ("asyn- 
chronous" multi-spin coding [7]). This setup allows for very efficient implementations using 
only a few bit-wise operations. Depending on the type of integers used, currently w = 32 or 
w = 64 spins can be represented in one variable. In parallel, 32 or 64 values of bimodal cou- 
plings of different realizations are stored in another field of integers. The resulting performance 
characteristics are shown in Fig. 14. The CPU code performs at 0.41 ns per spin flip for w = 32 
and a significantly better 0.18 ns for w - 64, corresponding to a speedup of 35 and 79 as com- 
pared to the single-spin coded CPU version for w = 32 and w - 64, respectively^. The GPU 
implementation performs at around 6.6 ps for 32-bit and 7.5 ps for 64-bit on the Tesla C1060 
card. The rather disappointing result for the w = 64 version shows that the GT200 architecture 



^Note that here I am using the same random number to update each of the w = 32 or w = 64 spins, which appears 
justified in view of the statistical independence of the coupling realizations. This setup was previously tested for the Ising 
spin glass in three dimensions and no significant coiTelation effects were observed [48]. 
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is not optimized for 64-bit operations. On the GTX 480 Fermi card, on the other hand, the 64-bit 
multi-spin coded implementation runs as fast as 2.3 ps per spin flip as compared to 3.4 ps for the 
32-bit version. The resulting optimal speed-ups compared to the CPU code are hence around 30 
for the Tesla C1060 and around 80 for the GTX 480. For the Fermi card, the resulting code runs 
around 55 times faster than the (synchronous) multi-spin coded simulation of the ferromagnetic 
model discussed in Ref. [23]. It is worthwhile to note that Janus performs at around 16 ps per spin 
flip on the 3D Ising spin glass [7]. Circumstances forbid a clear-cut comparison here, however, 
since the Janus code uses synchronous multi-spin coding and a somewhat better random-number 
generator. Still, it appears fair to say that an appropriate GPU implementation seems to play in 
the same league as the special-purpose computer Janus at a much smaller investment in capital 
and human resources. 

6. Parallel tempering 

A rather powerful technique for the simulation of systems with complex free energy land- 
scapes, including spin glasses and bio-polymers, is given by the parallel tempering or replica 
exchange algorithm [5, 49]. There, a number of identical copies of a system are simulated at 
a set of different, closely neighboring temperatures. In addition to any chosen intra-replica up- 
date, such as the single spin flips according to the Metropolis criterion (2), at fixed intervals an 
interchange of configurations (typically) simulated at neighboring temperatures is attempted and 
accepted with the generalized Metropolis probabiUty 

P..c{{si},l3 ^ {s\},l3') = min[l, e^^^j , (12) 

where AS = £({«■}) - £({«,}) and Afi = ji' - p. With a properly chosen set of temperatures [50- 
52], this additional update aUows for configurations with slow dynamics at low temperatures to 
successively difluse to high temperatures, decorrelate there and finally return back to the low- 
temperature phase, Ukely arriving in a difi'erent vaUey of a complex free-energy landscape than 
where it started from. 

Through its inherent parallelism of intra-repUca updates, this scheme appears to be weU 
suited for the massively parallel architecture of current GPUs. It is tested here in a reference im- 
plementation for the ferromagnetic Ising model along the Unes of the previously discussed single 
spin-flip code. The additional copies of the system are mapped to additional thread blocks run- 
ning in parallel. Since the configurational energy is calculated online from the energy changes 
of single spin flips via a binary-tree reduction algorithm on tiles, it is automatically available 
after lattice sweeps and hence its determination does not incur any extra overhead. In the current 
implementation, replica exchange steps are performed on CPU since the computational effort is 
low and synchronization between thread blocks is required. As usual, instead of exchanging con- 
figurations between replicas only the corresponding temperatures are swapped. The Boltzmann 
factors according to (2) can stiU be tabulated, and are now implemented on GPU as fetches from 
a two-dimensional texture. A successful replica exchange then requires an update of the texture 
which is easily possible in the current setup as the exchange moves are carried out on CPU. 

The benchmark results of the parallel tempering simulation of the 2D Ising model on GPU 
and CPU is shown in Fig. 15. I chose to use n = 20 repUcas at equally spaced inverse tempera- 
tures in the interval j8 = [0.1, 0.15]. In a real-world application one would probably want to use a 
more elaborate scheme for choosing these temperatures [50, 51], but these questions do not have 
any bearing on the benchmark calculations performed here. As is clearly visible, the presence 
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Figure 15: Spin-flip times for a pai'allel-tempering simulation of the 2D Ising model as a function of system size. Sim- 
ulations were performed on the range of inverse temperatures j8 = [0.1, 0.15] with n = 20 replicas with one exchange 
attempt per 100 sweeps of spin flips. 



of additional copies of the system leads to a much better resource utilization for smaller system 
sizes than that observed for the single-spin flip simulations. Hence, significant speed-ups are 
observed already for small systems. The CPU code performs at a constant 1 1 ns per combined 
spin-flip and replica exchange move (mixed at a ratio of one exchange move per one hundred 
lattice sweeps). The GPU code arrives at a maximum of around 0.089 ns for the Tesla C1060 
and 0.057 ns for the GTX 480 at L = 2048. The speed-up reaches up to 130 (CI 060) resp. 200 
(GTX 480) at L = 2048, but is already 109 (C1060) resp. 151 (GTX 480) for the L = 64 system. 
As for disordered systems due to the severe slowing down usually only rather small systems 
can be studied, good performance of the code is crucial in this regime. Increasing the number 
of exchange moves to one in ten lattice sweeps reduces the maximum performance of the GPU 
implementation somewhat to 0.1 1 ns (C1060) and 0.070 ns (GTX 480), respectively. 



7. Conclusions 

Current GPUs have a significant potential for speeding up scientific calculations as compared 
to the more traditional CPU architecture. In particular, this applies to the simulation of systems 
with local interactions and local updates, where the massive parallelism inherent to the GPU de- 
sign can work efficiently thanks to appropriate domain decompositions. The simulation of lattice 
spin systems appears to be a paradigmatic example of this class as the decomposition remains 
static and thus no significant communication overhead is incurred. Observing substantial speed- 
ups by factors often exceeding two orders of magnitude as for the case studies reported here 
requires a careful tailoring of implementations to the peculiarities of the considered architecture, 
however, in particular paying attention to the hierarchic organization of memories (including 
more exotic ones such as texture memory), the avoidance of branch divergence and the choice of 
thread and block numbers commensurate with the capabilities of the cards employed. For achiev- 
ing good performance, it is crucial to understand how these devices hide the significant latency 
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of accessing those memories that do not reside on die through the interleaved scheduUng of a 
number of execution threads significantly exceeding the number of available computing cores. It 
is encouraging that employing such devices with the rather moderate coding eff'ort mediated by 
high-level language extensions such as NVIDIA CUDA or OpenCL updating times in the same 
ballpark as those of special purpose machines such as Janus [7] with a vastly higher development 
effort can be reached. 

A regularly uttered objection against the systematic use of GPUs for scientific computing 
criticizes them as being a possibly too special and exotic architecture with unknown future avail- 
ability and architectural stability as compared to the traditional and extremely versatile x86 CPU 
design. While there is certainly some truth to this argument, there can be no doubt about the fact 
that massive parallelism is not only the present state of the GPU architecture, but also the future 
of CPU based computing. As of this writing, Intel CPUs feature up to eight cores and AMD 
chips up to twelve cores per die, the corresponding road-maps projecting even significantly more 
cores in the foreseeable future. Supercomputers will soon count their number of cores in the 
millions. Due to this development, serial computing will not remain a viable option for serious 
computational science much longer. Much effort will need to be invested in the years to come 
into solving scientific problems employing a massive number of parallel computational units. 
In this respect, GPU computing, apart from currently being more efficient for many problems 
in terms of FLOP/s per Watt and per Euro than CPU based solutions, is rather less exotic than 
pioneering. 

An ideal application for GPU computing in the field of the simulation of spin systems appear 
to be disordered systems, where cluster algorithms are in general not efficient and a natural 
parallelism occurs from the quenched average over disorder, possibly combined with the parallel 
tempering algorithm. Using asynchronous multi-spin coding for the Ising spin glass, spin flip 
times down to 2 ps can be achieved. Systems with continuous spins are particularly well suited 
for GPU deployment, since one finds a relatively stronger overhead of arithmetic operations over 
memory fetches as compared to systems with discrete spins there. For the Heisenberg model, 
speed-ups up to a factor of 1000 can be obtained when making use of the highly optimized 
special function implementations in single precision. While these examples of single-spin flip 
simulations look rather promising, it is clear that other well-established simulation algorithms for 
spin systems are less well suited for the parallelism of GPUs, including the cluster algorithms in 
the vicinity of critical points, where spin clusters spanning the whole system need to be identified, 
or multi-canonical and Wang-Landau simulations, which require access to the current values of 
a global reaction coordinate (such as, e.g., the energy or magnetization) for each single spin 
flip, thus effectively serializing all spin updates. It is a challenge for future research to find 
implementations or modifications of such algorithms suitable for massively paraUel computers 
[20]. 
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